Method for supervised teaching of a recurrent artificial neural network

ABSTRACT

A method for the supervised teaching of a recurrent neutral network (RNN) is disclosed. A typical embodiment of the method utilizes a large (50 units or more), randomly initialized RNN with a globally stable dynamics. During the training period, the output units of this RNN are teacher-forced to follow the desired output signal. During this period, activations from all hidden units are recorded. At the end of the teaching period, these recorded data are used as input for a method which computes new weights of those connections that feed into the output units. The method is distinguished from existing training methods for RNNs through the following characteristics: (1) Only the weights of connections to output units are changed by learning—existing methods for teaching recurrent networks adjust all network weights. (2) The internal dynamics of large networks are used as a “reservoir” of dynamical components which are not changed, but only newly combined by the learning procedure—existing methods use small networks, whose internal dynamics are themselves completely re-shaped through learning.

TECHNICAL FIELD OF THE INVENTION

[0001] The present invention relates to the field of supervised teachingof recurrent neural networks.

BACKGROUND OF THE INVENTION

[0002] Artificial neural networks (ANNs) today provide many establishedmethods for signal processing, control, prediction, and data modelingfor complex nonlinear systems. The terminology for describing ANNs isfairly standardized. However, a brief review of the basic ideas andterminology is provided here.

[0003] A typical ANN consists of a finite number K of units, which at adiscrete time t (where t=1,2,3 . . . ) have an activation x_(i)(t)·(i=1,. . . ,K ). The units are mutually linked by connections with weightsw_(ji), (where i, j=1, . . . , K and where w_(ji) is the weight of theconnection from the i-th to the j-th unit), which typically are assignedreal numbers. A weight w_(ji)=0 indicates that there is no connectionfrom the i-th to the j-th unit. It is convenient to collect theconnection weights in a connection matrix W=(w_(j,i))_(j,i=1, . . . K).The activation of the j-th unit at time t+1 is derived from theactivations of all network units at time t by $\begin{matrix}{{{x_{j}\left( {t + 1} \right)} = {f_{j}\left( {\sum\limits_{{i = 1},\quad \ldots \quad,\quad K}{w_{ji}{x_{i}(t)}}} \right)}},{K\quad {statt}\quad N}} & (1)\end{matrix}$

[0004] where the transfer function f_(j) typically is a sigmoid-shapedfunction (linear or step functions are also relatively common). In mostapplications, all units have identical transfer functions. Sometimes itis beneficial to add noise to the activations. Then (1) becomes$\begin{matrix}{{{x_{j}\left( {t + 1} \right)} = {{f_{j}\left( {\sum\limits_{{i = 1},\quad \ldots \quad,\quad K}{w_{ji}{x_{i}(t)}}} \right)} + {v(t)}}},{K\quad {statt}\quad N}} & \left( 1^{\prime} \right)\end{matrix}$

[0005] where v(t) is an additive noise term.

[0006] Some units are designated as output units; their activation isconsidered as the output of the ANN. Some other units may be assigned asinput units; their activation x_(i)(t) is not computed according to (1)but is set to an externally given input u_(i)(t), i.e.

x _(i)(t)=u _(i)(t)   (2)

[0007] in the case of input units.

[0008] Most practical applications of ANNs use feedforward networks, inwhich activation patterns are propagated from an input layer throughhidden layers to an output layer. The characteristic feature offeedforward networks is that there are no connection cycles. In formaltheory, feedforward networks represent input-output functions. A typicalway to construct a feedforward network for a given functionality is toteach it from a training sample, i.e. to present it with a number ofcorrect input-output-pairings, from which the network learns toapproximately repeat the training sample and to generalize to otherinputs not present in the training sample. Using a correct trainingsample is called supervised learning. The most widely used supervisedteaching method for feedforward networks is the backpropagationalgorithm, which incrementally reduces the quadratic output error on thetraining sample by a gradient descent on the network weights. The fieldhad its breakthrough when efficient methods for computing the gradientbecame available, and is now an established and mature subdiscipline ofpattern classification, control engineering and signal processing.

[0009] A particular variant of feedforward networks, radial basisfunction networks (RBF networks), can be used with a supervised learningmethod that is simpler and faster than backpropagation. (An introductionto RBF networks is given in the article “Radial basis function networks”by D. Lowe, in: Handbook of Brain Theory and Neural Networks, M. A.Arbib (ed.), MIT Press 1995, p. 779-782) Typical RBF networks have ahidden layer whose activations are computed quite differently from (1).Namely, the activation of the j-th hidden unit is a function

g_(j)(∥u−v_(j)∥)   (3)

[0010] of the distance between the input vector u from some referencevector v_(j). The activation of output units follows the prescription(1), usually with a linear transfer function. In the teaching process,the activation mechanism for hidden units is not changed. Only theweights of hidden-to-output connections have to be changed in learning.This renders the learning task much simpler than in the case ofbackpropagation: the weights can be determined off-line (afterpresentation of the training sample) using linear regression methods, orcan be adapted on-line using any variant of mean square errorminimization, for instance variants of the least-mean-square (LMS)method.

[0011] If one admits cyclic paths of connections, one obtains recurrentneural networks (RNNs). The hallmark of RNNs is that they can supportself-exciting activation over time, and can process temporal input withmemory influences. From a formal perspective, RNNs realize nonlineardynamical systems (as opposed to feedforward networks which realizefunctions). From an engineering perspective, RNNs are systems with amemory. It would be a significant benefit for engineering applicationsto construct RNNs that perform a desired input-output-dynamics. However,such applications of RNNs are still rare. The major reason for thisrareness lies in the difficulty of teaching RNNs. The state of the artin supervised RNN learning is marked by a number of variants of thebackpropagation through time (BPTT) method. A recent overview isprovided by A. F. Atiya and A. G. Parlos in the article “New Results onRecurrent Network Training: Unifying the Algorithms and AcceleratingConvergence”, IEEE Transactions on Neural Networks, vol. 11 No 3 (2000),697-709. The intuition behind BPTT is to unfold the recurrent network intime into a cascade of identical copies of itself, where recurrentconnections are re-arranged such that they lead from one copy of thenetwork to the next (instead back into the same network). This“unfolded” network is, technically, a feedforward network and can beteached by suitable variants of teaching methods for feedforwardnetworks. This way of teaching RNNs inherits the iterative,gradient-descent nature of standard backpropagation, and multiplies itsintrinsic cost with the number of copies used in the “unfolding” scheme.Convergence is difficult to steer and often slow, and the singleiteration steps are costly. By force of computational costs, onlyrelatively small networks can be trained. Another difficulty is that theback-propagated gradient estimates quickly degrade in accuracy (going tozero or infinity), thereby precluding the learning of memory effects oftimespans greater than approx. 10 timesteps. These and otherdifficulties have so far prevented RNNs from being widely used.

SUMMARY OF THE INVENTION

[0012] The invention is defined by the method of claim 1 and the networkof claim 39. Individual embodiments of the invention are specified inthe dependent claims.

[0013] The present invention presents a novel method for the supervisedteaching of RNNs. The background intuitions behind this method are quitedifferent from existing BPTT approaches. The latter try to meet thelearning objective by adjusting every weight within the network, therebyattaining a minimal-size network in which every unit contributesmaximally to the desired overall behavior. This leads to a small networkthat performs a particular task. By contrast, the method disclosed inthe present invention utilizes a large recurrent network, whose internalweights (i.e. on hidden-to-hidden, input-to-hidden, or output-to-hiddenconnections) are not changed at all. Intuitively, the large, unchangednetwork is used as a rich “dynamical reservoir” of as many differentnonlinear dynamics as there are hidden units. Another perspective onthis reservoir network is to view it as an overcomplete basis. Only thehidden-to-output connection weights are adjusted in the teachingprocess. By this adjustment, the hidden-to-output connections acquirethe functionality of a filter which distills and re-combines from the“reservoir” dynamical patterns in a way that realizes the desiredlearning objective.

[0014] A single instantiation of the “reservoir” network can be re-usedfor many tasks, by adding new output units and separately teaching theirrespective hidden-to-output weights for each task. After learning,arbitrarily many such tasks can be carried out in parallel, using thesame single instantiation of the large “reservoir” network. Thereby, theoverall cost of using an RNN set up and trained according to the presentinvention is greatly reduced in cases where many different tasks have tobe carried out on the same input data. This occurs e.g. when a signalhas to processed by several different filters.

[0015] The temporal memory length of RNNs trained with the method of theinvention is superior to existing methods. For instance, “short termmemories” of about 100 time steps are easily achievable with networks of400 units. Examples of this are described later in this document(Section on Examples).

[0016] The invention has two aspects: (a), architectural (structure ofthe RNN, its setup and initialization), and (b), procedural (teachingmethod). Both aspects are interdependent.

[0017] Dynamical Reservoir (DR)

[0018] According to one architectural aspect of the invention, there isprovided a recurrent neural network whose weights are fixed and are notchanged by subsequent learning. The function of this RNN is to serve asa “reservoir” of many different dynamical features, each of these beingrealized in the dynamics of the units of the network Henceforward, thisRNN will be called the dynamical reservoir, and abbreviated by DR.

[0019] Preferably, the DR is large, i.e. has in the order of 50 or more(no upper limit) units.

[0020] Preferably, the DR's spontaneous dynamics (with zero input) isglobally stable, i.e. the DR converges to a unique stable state fromevery starting state.

[0021] In applications where the processed data has a spatialstructuring (e.g., video images), the connectivity topology of the DRmay also carry a spatial structure.

[0022] Input Presentation

[0023] According to another architectural aspect of the invention,n-dimensional input u (t) at time t(t=1,2,3 . . . ) is presented to theDR by any means such that the DR is induced by the input to exhibit arich excited dynamics.

[0024] The particular way in which input is administered is of noconcern for the method of the invention. Some possibilities which aretraditionally used in the RNN field are now briefly mentioned.

[0025] Preferably, the input is fed into the DR by means of extra inputunits. The activations of such input units is set to the input u(t)according to Eq. (2). In cases where the input has a spatiotemporalcharacter (e.g., video image sequences), the input units may be arrangedin a particular spatial fashion (“input retina”) and connected to the DRin a topology-preserving way. Details of how the weights of theinput-to-DR units are determined, are given in the “detailed descriptionof preferred embodiments” section.

[0026] Alternatively, input values can be fed directly as additivecomponents to the activations of the units of the DR, with or withoutspatial structuring.

[0027] Alternatively, the input values can be coded before they arepresented to the DR. For instance, spatial coding of numerical valuescan be employed.

[0028] Reading Out Output

[0029] According to another architectural aspect of the invention,m-dimensional output y(t) at time t is obtained from the DR by readingit out from the activations of m output units (where m≧1). Byconvention, the activations of the output units shall be denoted byy₁(t), . . . , y_(m)(t).

[0030] In a preferred embodiment of the invention, these output unitsare attached to the DR as extra units. In this case (i.e., extra outputunits), there may also be provided output-to-DR connections which feedback output unit activations into the DR network. Typically, no suchfeedback will be provided when the network is used as a passive devicefor signal processing (e.g., for pattern classification or forfiltering). Typically, feedback connections will be provided when thenetwork is used as an active signal generation device. Details of how todetermine feedback weights are described in the “detailed description ofpreferred embodiments” section.

[0031] According to another architectural aspect of the invention, theactivation update method for the m outputs y₁(t), . . . , y_(m)(t) is ofthe form given in equation (1), with transfer functions f₁, . . . ,f_(m). The transfer functions f_(j) of output units typically will bechosen as sigmoids or as linear functions.

[0032]FIG. 1 provides an overview of a preferred embodiment of theinvention, with extra input and output units. In this figure, the DR [1]is receiving input by means of extra input units [2] which feed inputinto the DR through input-to-DR connections [4]. Output is read out ofthe network by means of extra output units [3], which in the example ofFIG. 1 also have output-to-DR feedback connections [7]. Input-to-DRconnections [4] and output-to-DR feedback connections [7] are fixed andnot changed by training. Finally, there are DR-to-output connections [5]and [possibly, but not necessarily] input-to-output connections [6]. Theweights of these connections [5], [6] are adjusted during training.

[0033] Next, the procedural aspects of the invention (teaching method)are related. As with all supervised teaching methods for RNNs, it isassumed that a training sequence is given. The training sequenceconsists of two time series u(t) and {tilde over (y)}(t), where t=1,2, .. . ,N. It is tacitly understood that in cases of online learning, Nneed not be determined at the outset of the learning; the learningprocedure is then an open-ended adaptation process. u(t) is ann-dimensional input vector (where n≧0 i.e., the no-input case n=0 isalso possible), and {tilde over (y)}(t) is an m-dimensional outputvector (with m≧1). The two time series u(t) and {tilde over (y)}(t)represent the desired, to-be-learnt input-output behavior. As a specialcase, the input sequence u(t) may be absent; the learning task is thento learn a purely generative dynamics.

[0034] The training sequences u(t), {tilde over (y)}(t) are presented tothe network for t=1,2, . . . ,N. At every time step, the DR is updated(according to the chosen update law, e.g., Equation (1)), and theactivations of the output units are set to the teacher signal {tildeover (y)}(t) (teacher forcing).

[0035] The method of the invention can be accommodated off-line learningand on-line learning.

[0036] In off-line learning, both the activation vector x(t) ofnon-output-units and the teacher signal {tilde over (y)}(t) arecollected for t=1,2, . . . ,N. From these data, at time N there arecalculated weights w_(ji) for connections leading into the output units,such that the mean square error $\begin{matrix}{{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}\left( {{f_{j}^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - {\langle{w_{j},{x(t)}}\rangle}} \right)^{2}}}}\quad} & (4)\end{matrix}$

[0037] Index j nach f{circumflex over ( )}−1 is minimized for everyoutput unit j=1, . . . ,m over the training sequence data In equation(4), <w_(j),x(t)> denotes the inner product

w_(j1)u₁(t)+ . . . +w_(jn)u_(n)(t)+w_(j,n+1)x₁(t)+ . . .+w_(j,n+K)x_(K)(t)+w_(j,n+K+1)y₁(t)+ . . . +w_(j,n+K+m)y_(m)(t),   (5)

[0038] this form of <w_(j),x(t)> being given if there are extra inputunits. The calculation of weights which minimize Eq. (4) is a standardproblem of linear regression, and can be done with any of the well-knownsolution methods for this problem. Details are given in the Section“Detailed description of preferred embodiments”.

[0039] The weights w_(ji) are the final result of the procedural part ofthe method of the invention. After setting these weights in theconnections that feed into the output units, the network can beexploited.

[0040] In online-learning variants of the invention, the weights w_(j)are incrementally adapted. More precisely, for j=1, . . . ,m, theweights w_(j)(t) are updated at every time t₀=1,2, . . . ,N by asuitable application of any of the many well-known methods thatadaptively and incrementally minimize the mean square error up to timet₀, $\begin{matrix}{{E\left\lbrack {ɛ_{j}^{2}\left( t_{0} \right)} \right\rbrack} = {\frac{1}{t_{0} - 1}{\sum\limits_{t = 1}^{t_{0} - 1}{\left( {{f^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - {\langle{w_{j},{x(t)}}\rangle}} \right)^{2}.}}}} & \left( {4a} \right)\end{matrix}$

[0041] Adaptive methods that minimize this kind of error are knowncollectively under the name of “recursive least squares” (RLS) methods.Alternatively, from a statistical perspective one can also minimize thestatistically expected square error

E[ε _(j) ² ]=E[f ⁻¹({tilde over (y)} _(j)(t+1))−<w _(j) ,x(t)>],   (4b)

[0042] where on the right-hand side E denotes statistical expectation.Adaptive methods that minimize (4b) are stochastic gradient descentmethods, of which there are many, among them Newton's method and themost popular of all MSE minimization methods, the LMS method. However,the LMS method is not ideally suited to be used with the method of theinvention. Details are given in the Section “Detailed description ofpreferred embodiments”.

BRIEF DESCRIPTION OF THE FIGURES

[0043] The provided Figures are, with the exception of FIG. 1,illustrations of the examples described below. They are referenced indetail in the description of the examples. Here is an overview of theFigures.

[0044]FIG. 1 is a simplified overview of a preferred embodiment of theinvention.

[0045]FIG. 2 shows various data sets obtained from the first example, asimplistic application of the method of the invention to obtain a Sinegenerator network, which is reported for didactic reasons.

[0046]FIG. 3 shows various data sets obtained from the second example,an application of the method of the invention to obtain a short timememory network in the form of a delay line.

[0047]FIG. 4 shows the connectivity setup and various data sets obtainedfrom the third example, an application of the method of the invention toobtain a model of an excitable medium trained from a single “soliton”teacher signal.

[0048]FIG. 5 shows various data sets obtained from the fourth example,an application of the method of the invention to learn a chaotic timeseries generator.

[0049]FIG. 6 illustrates the fifth example, by providing a schematicsetup of a network applied to learning a state feedback trackingcontroller for a pendulum, and various data sets obtained in thisexample.

[0050]FIG. 7 shows various data sets obtained from the sixth example, anapplication of the method of the invention to learn a bidirectionaldevice which can be used as a frequency meter or a frequency generator.

DESCRIPTION OF SOME EXAMPLES

[0051] Before the invention is described in detail in subsequentsections, it will be helpful to demonstrate the invention with someexemplary embodiments. The examples are selected to highlight differentbasic aspects of the invention.

Example 1

[0052] A Toy Example to Illustrate Some Basic Aspects of the Invention

[0053] This example demonstrates the basic aspects of the invention witha toy example. The task is to teach a RNN to generate a sine wavesignal. Since this task is almost trivial, the size of the DR wasselected to be only 20 units (for more interesting tasks, network sizesshould be significantly greater).

[0054] First, it is shown how the network architecture was set up. The20 units were randomly connected with a connectivity of 20%, i.e., onaverage every unit had connections with 4 other units (includingpossible self-connections). The connection weights were set randomly toeither 0.5 or −0.5.

[0055] This network was left running freely. FIG. 2a shows a trace of 8arbitrarily selected units in the asymptotic activity. It is apparentthat all DR units are entrained a low-amplitude oscillation.

[0056] According to the architectural aspects of the invention, anautonomous self-excitation of the DR is not desired. The DR's autonomousdynamics should be globally stable, i.e., converge to a stable all-zerostate from any initial starting state. Therefore, the weights weredecreased by a factor of 0.98, i.e., a weight that was previously 0.5was put to 0.49. FIG. 2b shows a 200 step trace obtained after 200initial steps after starting the network in a random initial state. Itis apparent that with the new weights the network's dynamics is globallystable, i.e. will asymptotically decay to all zero activations.

[0057] This global stability is only marginal in the sense that a slightincrease of weights would render the dynamics unstable (in this case,oscillation would set in by an increase of absolute weight values from0.49 to 0.5). A marginal global stability in this sense is often thedesired condition for the setup of the DR according to the invention.

[0058] Next, the response characteristics of the DR is probed. To thisend, an extra input unit was attached. It was completely connected tothe DR, i.e., a connection was established from the input unit to everyof the 20 units of the DR. The connection weights were set to valuesrandomly taken from the interval [−2, 2]. FIG. 2c shows the response ofthe network to a unit impulse signal given at time t=10. The first sevenplots in FIG. 2c show activation traces of arbitrarily selected DRunits. The last plot shows the input signal. It becomes apparent thatthe DR units show a rich variety of response dynamics. This is thedesired condition for the setup of DRs according to the invention.

[0059] Next, the response of the DR network to a sine input was probed.Analogous to FIG. 2c, FIG. 2d shows the asymptotic response of seven DRunits and the input signal. This Figure again emphasizes the richvariety of responses of DR units. Finally, the network was trained togenerate the same sine signal that was administered previously as input.The extra unit that was previously used as input unit was left unchangedin its connections to the DR, but now was used as an output unit.Starting from an all zero activation, the network was first run for 100steps with teacher forcing to settle initial transients. The, it was runanother 500 steps with teacher forcing. The activation values of the 20DR units were recorded for these 500 steps. At time t=600, an offlinelearning of weights from the DR to the output unit was performed, i.e.,the DR-to-output weights were computed as the solutions of a linearregression of the desired output values to the DR states, minimizing themean square error of Equation (4). Thereafter, teacher forcing wasswitched off, and the network was left run freely for another 10,000steps. After that, 50 steps were plotted to obtain FIG. 2e. Here, theeighth plot shows the activation of the output unit. Unsurprisingly,FIG. 2e is virtually the same as FIG. 2d. FIG. 2f shows a superpositionof the output with the teacher (but unknown to the network)signal:teacher signal=solid line, network output=dashed line. The dashedis identical to the solid line at the plotting resolution; in fact, thenumerical value of the mean square error (4) was 1.03×10⁻¹³ for this(simple) learning task.

Example 2

[0060] A Short Time Memory

[0061] In this example it is shown how the method of the invention canbe used to teach an RNN to produce delayed versions of the input.

[0062] The network was set up as in FIG. 1. The DR had a size of 100units. It was randomly connected with a connectivity of 5%. Nonzeroweights were set to +0.45 or −0.45 with equal probability. This resultedin a globally stable dynamics of the DR (again, of marginal stability:increasing absolute values of weights to 0.475 would destroy globalstability). The impulse response of the DR's units to a unit impulsewere qualitatively similar to the ones in example 1 (cf. FIG. 2c) andare not shown.

[0063] One input unit was attached to the DR, by connecting the inputunit to every unit of the DR. Weights of these connections were randomlyset to 0.001 or −0.001 with equal probability.

[0064] Furthermore, three extra output units were provided, with nooutput-to-DR feedback connections.

[0065] The learning task consisted in repeating in the output node theinput signal with delays of 10, 20, 40 time steps. The input signal usedwas essentially a random walk with a banded, nonstationary frequencyspectrum. FIG. 3a shows a 50-step sequence of the input (solid line) andthe correct delayed signal (teacher signal) of delay 10 (dashed line).

[0066] The network state was randomly initialized. The input was thenpresented to the network for 700 update steps. Data from the first 200update steps were discarded to get rid of initial transient effects.Data from the remaining 500 update steps were collected and used withthe off-line embodiment of the learning method of the invention. Theresult were weights for the connections from DR and input units tooutput units. The network run was continued with the learnt weights foranother 150 update steps. The input and outputs of the last 50 updatesteps are plotted in FIG. 3(b). The three plots show the correct delayedsignal (solid) superimposed on the outputs generated by the learntnetwork (dashed). It becomes apparent that the network has successfullylearnt to delay a signal even for as long as 40 time steps.

[0067] In order to quantify the precision of the learnt network output,the mean square error of each of the three output units was calculatedfrom a sample sequence. They were found to be 0.0012, 0.0013, 0.0027 forthe delays of 10, 20, 40, respectively.

[0068] Comment. The challenge of this learning task is that the networkhas to serve as a temporal memory. This goal is served by two aspects ofthe setup of the network for learning. First, the autonomous dynamics ofthe DR was tuned such that it was globally stable only by a smallmargin. The effect is that dynamic aftereffects of input die out slowly,which enhances the temporal memory depth. Second, the input-to-DRconnections had very small weights. The effect was that the ongoing(memory-serving) activation within the DR net is only weakly modulated,such that memory-relevant “repercussions” are not too greatly disturbedby incoming input.

Example 3

[0069] Learning an Excitable Medium

[0070] In this example it is demonstrated how the method of theinvention can be used to train a 2-dimensional network to support thedynamics of an excitable medium.

[0071] The network was set up as in FIGS. 4a,b. It consisted of twolayers of 100 units, which were each arranged in a 10×10 grid. To avoiddealing with boundary conditions, the grid was topologically closed intoa torus. The first layer was used as the DR, the second layer was theoutput layer.

[0072] A local connectivity pattern was provided, as follows. Each unitof the first layer received connections from locally surrounding unitswithin that layer (FIG. 4a). The weights were set depending on thedistance r1 between units, as shown in FIG. 4c. The resulting internalDR dynamics is depicted in FIG. 4d, which shows the response of 8arbitrarily selected units of the first layer to a unit impulse fed intothe first unit at timestep 10. It can be seen that the DR dynamics diesout, i.e., it is globally stable.

[0073] Each unit of the first layer additionally received connectionsfrom output units that lied in a local neighborhood of radius r2. Thedependency of weights from distance r2 is shown in FIG. 4e.

[0074] Among all possible connections from the DR to a particular outputunit, only the ones within a grid distance r3 was less or equal to 4(FIG. 4b) had to be trained. The goal of learning was to obtain weightsfor these DR-to-output connections.

[0075] No input was involved in this learning task.

[0076] The teaching signal consisted in a “soliton” wave which wasteacher-forced on the output layer. The soliton slowly wandered withconstant speed and direction across the torus. FIG. 4f shows foursuccessive time steps of the teacher signal. Note the effects of torustopology in the first snapshot.

[0077] The teaching proceeded as follows. The DR network state wasinitialized to all zeros. The network was then run for 60 time steps.The DR units were updated according to Equation (1), with a sigmoidtransfer function f=tanh. The output units were updated by teacherforcing, i.e., the teacher signal shown in FIG. 4f was written into theoutput units. Data from the first 30 time steps were discarded, and thedata collected from the remaining 30 time steps were used for theoff-line embodiment of the learning method of the invention. The resultwere weights for the connections from DR units to output units. Aspeciality of this learning task is that the result of the teachingshould be spatially homogeneous, i.e., all output units should beequipped with the same set of weights. This allowed that the dataobtained from all 100 output units could be pooled for the learningmethod of the invention, i.e. a training sample of effectively100×30=3000 pairings of network states and desired outputs were used tocalculate the desired weight set.

[0078] To get an impression of what the network has learnt, severaldemonstration runs were performed with the trained network.

[0079] In the first demonstration, the network was teacher-forced withthe soliton teacher for an initial period of 10 time steps. Then theteacher forcing was switched off and the network was left running freelyfor 100 further steps. FIG. 4g shows snapshots taken at time steps 1, 5,10, 20, 50, 100 from this free run. The initially forced solitonpersists for some time, but then the overall dynamics reorganizes into astable, symmetric pattern of two larger solitons that wander across thetorus with the same speed and direction as the training soliton.

[0080] In other demonstrations, the network was run from randomizedinitial states without initial teacher forcing. After some time(typically less than 50 time steps), globally organized, stable patternsof travelling waves emerged. FIG. 4h shows a smooth and a rippled wavepattern that emerged in this way.

[0081] Comment. This example highlights how the method of the inventionapplies to spatial dynamics. The learning task actually is restricted toa single output unit; the learnt weights are copied to all other outputunits due to the spatial homogeneity condition that was imposed on thesystem in this example. The role of the DR is taken by the hidden layer,whose weights in this case were not given randomly (as in the previousexamples) but were designed according to FIG. 4c.

Example 4

[0082] Learning a Chaotic Oscillator: the Lorenz Attractor

[0083] In this example it is shown how the method of the invention canbe used for the online-learning of a chaotic oscillator, in the presenceof noise in the teaching signal.

[0084] The network was set up with a randomly and sparsely connected DR(80 units, connectivity 0.1, weights +0.4 or −0.4 with equalprobability) and a single output unit (output-to-DR feedback connectionswith full connectivity, random weights drawn from uniform distributionover [−2, 2]). The update rule was a “leaky integration” variant of Eq.(1), which uses a “potential” variable v to mix earlier states with thecurrent state: $\begin{matrix}{{{x_{j}\left( {t + 1} \right)} = {f\left( {v_{j}\left( {t + 1} \right)} \right)}}{{v_{j}\left( {t + 1} \right)} = {{\left( {1 - a_{j}} \right)\left( {\sum\limits_{{i = 1},\quad \ldots \quad,\quad N}{w_{ji}{x_{i}(t)}}} \right)} + {a_{j}{v_{j}(t)}}}}} & (6)\end{matrix}$

[0085] A transfer function f=tanh was used. The leaking coefficientsa_(j) were chosen randomly from a uniform distribution over [0, 0.2].

[0086] As in the previous examples, this setup resulted in an RNN withmarginal global stability and a rich variety in impulse responses of theindividual units.

[0087] The 1-dimensional teaching signal was obtained by projecting thewell-known 3-dimensional Lorenz attractor on its first dimension. Asmall amount of noise was added to the signal. A delay-embeddingrepresentation of the noisy teacher signal is shown in FIG. 5a, and ofthe teacher signal without noise in FIG. 5b. The learning task was toadapt the DR-to-output weights (using the noisy training signal) suchthat the neural network reproduced in its output unit dynamics the(noise-free) Lorenz trace.

[0088] The output weights were trained according to the method of theinvention. For demonstration purposes, three variants are reported here:(a) offline learning, (b) online learning with the RLS method, (c)online learning with the LMS method.

[0089] Offline learning. The network state was initialized to all zero.The input was then presented to the network, and the correct teacheroutput was written into the three output nodes (teacher forcing) for5100 update steps. Data from the first 100 update steps were. Data fromthe remaining 5000 update steps with teacher forcing were collected andused to determine DR-to-output weights with minimal MSE (Eq. (4)) with alinear regression computation. The MSE (4) incurred was 0.000089 (thetheoretically possible minimum mean square error, stemming from thenoise component in the signal, would be 0.000052). A time seriesgenerated by the trained is shown in FIG. 5c.

[0090] Online learning with the RLS method. The “recursive leastsquares” method can be implemented in many variants. Here, the versionfrom the textbook B. Farhang-Boroujeny, Adaptive Filters: Theory andApplications, Wiley & Sons 1999, p. 423 was used. The same DR was usedas in the offline learning version. The “forgetting rate” required byRLS was set to λ=0.9995. FIG. 5d shows the learning curve (developmentof log₁₀(ε²), low-pass filtered by averaging over 100 steps per plotpoint). The error converges to the final misadjustment level ofapproximately 0.000095 after about 1000 steps, which is slightly worsethan in the offline trial. FIG. 5e shows a time series generated by thetrained network.

[0091] Online learning with the LMS method. The least mean squaresmethod is very popular due to its robustness and simplicity. However, aswas already mentioned in the “Summary of the Invention”, it is not idealin connection with the method of the invention. The reason is that DRstate vectors have large Eigenvalue spreads. Nevertheless, forillustration of this fact, the LMS method was carried out. The LMSmethod updates weights at every time step according to:

w _(ji)(t+1)=w _(ji)(t)+μεx _(i)(t),   (7)

[0092] where μ is a learning rate, j is the index of the output unit,ε=f⁻¹({tilde over (y)}_(j)(t))−f⁻¹(y_(j)(t)) is the output unit stateerror, i.e. the difference between the (f-inverted) teacher signal{tilde over (y)}_(j)(t) and the (f-inverted) output unit signaly_(j)(t).

[0093] The network was adapted in five successive epochs with decreasinglearning rates μ:1. μ=0.03, N=1000 steps, 2. μ=0.01, N=10.000, 3.μ=0.003, N=50.000, 4. μ=0.001, N=100.000, 5. μ=0.0003, N=200.000. At theend of the fifth epoch, a mean square error E[ε²]≈0.000125 was reached.FIG. 5f shows the learning curve (all epochs joined), and FIG. 5g showsa time series generated by the trained network. It is apparent that thetrained network produces a point attractor instead of a chaoticattractor. This highlights the fact that the LMS method is ill-suitedfor training DR-to-output weights. A closer inspection of the Eigenvaluedistribution of the covariance matrix of state vectors x(t) of thetrained network reveals that the Eigenvalue spread is very high indeed:λ_(max)/λ_(min)≈3×10⁸. FIG. 5h gives a log plot of the Eigenvalues ofthis matrix. Eigenvalue distributions like this are commonly found inDRs which are prepared as sparsely connected, randomly weighted RNNs.

Example 5

[0094] A Direct/State Feedback Controller

[0095] In this example it is shown how the method of the invention canbe used to obtain a state feedback neurocontroller for tracking controlof a damped pendulum.

[0096] The pendulum was simulated in discrete time by the differenceequation

ω(t+δ)=ω(t)+δ(−k ₁ω(t)−k ₂sin(φ(t))+u(t)+v(t)) φ(t+δ)=φ(t)+δω(t)   (8)

[0097] where ω is the angular velocity, φ is the angle, δ is thetimestep increment, u(t) is the control input (torque), and v(t) isuncontrolled noise input. The constants were set to k₁=0.5, k₂=1.0, δ=0.1, and the noise input was taken from a uniform distribution in [−0.02,0.02].

[0098] The task was to train a tracking controller for the pendulum.More specifically, the trained controller network receives atwo-steps-ahead reference trajectoryy_(ref)(t+2δ)=(x_(1ref)(t+2δ),x_(2ref)(t+2δ),ω_(ref)(t+2δ)), wherex_(1ref)(t+2δ),x_(2ref)(t+2δ) are the desired position coordinates ofthe pendulum endpoint and ω_(ref)(t+2δ) is the desired angular velocity.The length of the pendulum was 0.5, so x_(1ref)(t+2),x_(2ref)(t+2) rangein [−0.5,0.5]. Furthermore, the controller receives state feedbacky(t)=(x₁(t),x₂(t),ω(t)) of the current pendulum state. The controllerhas to generate a torque control input u(t) to the pendulum such thattwo update steps after the current time t the pendulum tracks thereference trajectory. FIG. 6a shows the setup of the controller in theexploitation phase.

[0099] For training, a 500-step long teacher signal was prepared bysimulating the pendulum's response to a time-varying control input ũ(t),which was chosen as a superposition of two random banded signals, onewith high frequencies and small amplitude, the other with lowfrequencies and high amplitude. FIG. 6c shows the control input ũ(t)used for the training signal, FIG. 6d shows the simulated pendulum'sstate answer x₂ (t) and FIG. 6e the state answer ω(t) (the state answercomponent x₁(t) looks qualitatively like x₂(t) and is not shown). Thetraining signal for the network consisted of inputsy(t)=(x₁(t),x₂(t),ω(t)) and y(t+2δ)=(x₁(t+2δ),x₂(t+2δ), ω(t+2δ)); fromthese inputs, the network had to learn to generate as its outputu(t)·u(t) stat\omega(t) FIG. 6b shows the training setup.

[0100] The network was set up with the same 100-unit DR as in theprevious (Lorenz attractor) example. 6 external input units weresparsely (connectivity 20%) and randomly (weights +0.5, −0.5 with equalprobability) attached to the DR, and one output unit was providedwithout feedback connections back to the DR. The network update rule wasthe standard noisy sigmoid update rule (1′) for the internal DR units(noise homogeneously distributed in [−0.01, +0.01]). The output unit wasupdated with a version of Eq. (1) where the transfer function was theidentity (i.e., a linear unit). The DR-to-output weights were computedby a simple linear regression such that the error ε(t)=ũ(t)−u(t) wasminimized in the mean square sense over the training data set (N=500),as indicated in FIG. 6b.

[0101] In a test, the trained network was presented with a targettrajectory y_(ref)(t+2δ)=(x_(1ref)(t+2δ),x_(2ref)(t+2δ),ω_(ref)(t+2δ))at the 3 units which in the training phase received the inputy(t+2δ)=(x₁(t+2δ),x₂(t+2δ),ω(t+2δ)). The network further received statefeedback y(t)=(x, (t), X₂ (t), CD (t)) from the pendulum at the 3 unitswhich received the signals y(t)=(x₁(t),x₂(t),ω(t)) during training. Thenetwork generated a control signal u(t) which was fed into the simulatedpendulum. FIG. 6f shows the network output u(t); FIG. 6g shows asuperposition of the reference x_(2ref)(t+2δ) (solid line) with the2-step-delayed pendulum trajectory x₂(t+2δ) (dashed line); FIG. 6h showsa superposition of the reference ω_(ref)(t+2δ) (solid line) with the2-step-delayed pendulum trajectory ω(t+2δ) (dashed line). The networkhas learnt to function as a tracking controller.

[0102] Discussion. The trained network operates as a dynamical statefeedback tracking controller. Analytic design of perfect trackingcontrollers for the pendulum is not difficult if the system model (8) isknown. The challenge in this example is to learn such a controllerwithout apriori information from a small training data set.

[0103] The approach to obtain such a controller through training of arecurrent neural network ist novel and represents a dependent claim ofthe invention. More specifically, the claim is a method to obtainclosed-loop tracking controllers by training of a recurrent neuralnetwork according to the method of the invention, where (1) the inputtraining data consists of two vector-valued time series of the formy(t+Δ),y(t), where y(t+Δ) is a future version of the variables that willserve as a reference signal in the exploitation phase, and y(t) arestate or observation feedback variables (not necessarily the same as iny(t+Δ)), (2) the output training data consists in a vector ũ(t), whichis the control input presented to the plant in order to generate thetraining input data y(t+Δ),y(t).

Example 6

[0104] A Two-Way Device: Frequency Generator+Frequency Meter

[0105] In this example it is shown how the method of the invention canbe used to obtain a device which can be used in two ways: as a tunablefrequency generator (input: frequency target, output: oscillation ofdesired frequency) and as a frequency meter (input: oscillation, output:frequency indication). The network has two extra units, each of whichcan be used either as an input or as an output unit. During training,both units are treated formally as output units, in the sense that twoteacher signals are presented simultaneously: the target frequency andan oscillation of that frequency.

[0106] In the training phase, the first training channel is a slowlychanging signal that varies smoothly but irregularly between 0.1 and 0.3(FIG. 7a). The other training channel is a fast sine oscillation whosefrequency is varying according the first signal (FIG. 7b, the apparentamplitude jitter is a discrete-sampling artifact).

[0107] The network was set up with a DR of 100 units. The connectionweight matrix W was a band matrix with a width 5 diagonal band (i.e.,w_(ji)=0 if |j−i|≧3). This band structure induces a topology on theunits. The nearer two units (i.e., the smaller |j−i| mod 100), the moredirect their coupling. This locality lets emerge locally differentactivation patterns. FIG. 7c shows the impulse responses of every 5thunit (impulse input at timestep 10). The weights within the diagonalband were preliminarily set to +1 or −1 with equal probability. Theweights were then globally and uniformly scaled until the resulting DRdynamics was marginally globally stable. This scaling resulted inweights of ±0.3304 with a stability margin of δ=0.0025 (stabilitymargins are defined in the detailed description of preferred embodimentslater in this document).

[0108] Additionally, the two extra units were equipped with feedbackconnections which projected back into the DR. These connections wereestablished randomly with a connectivity of 0.5 for each of the twoextra units. The weights of these feedback connections were chosenrandomly to be ±1.24 for the first extra unit and ±6.20 for the secondextra unit.

[0109] The network state was randomly initialized, and the network wasrun for 1100 steps for training. Two signals of the same kind as shownin FIGS. 7a,b were presented to the network (the target frequency signalto the first extra unit and the oscillation to the second) and thecorrect teacher output was written into the two output nodes (teacherforcing). The update of DR units was done with a small additive noiseaccording to Eq. (1′). The noise was sampled from a uniform distributionover [−0.02, 0.02]. Data from the first 100 update steps were discarded.Data from the remaining 1000 update steps with teacher forcing werecollected and used to obtain a linear regression solution of the leastmean square error Eq. (4). The result were weights for the connectionsfrom DR units to the two output units.

[0110] In the exploitation phase, the trained RNN was used in either oftwo ways, as a frequency generator or as a frequency meter. In theexploitation phase, the no-noise version Eq. (1) of the update rule wasused.

[0111] In the frequency generator mode of exploitation, the first extraunit was treated as an input unit, and the second as an output unit. Atarget frequency signal was fed into the input unit, for instance the400 timestep staircase signal shown in FIG. 7d. At the second extraunit, here assigned to the output role, an oscillation was generated bythe network. FIG. 7e shows an overlay of an oscillation of the correctfrequency demanded by the staircase input (solid line) with the outputactually generated by the network (dashed line). FIG. 7f shows anoverlay of the frequency amplitudes (absolutes of Fourier transforms) ofthe output signal (solid line) and the network-generated output (dashedline). It appears from FIGS. 7e,f that the network has learnt togenerate oscillations of the required frequencies, albeit with frequencydistortions in the low and high end of the range. FIG. 7g shows tracesof 8 arbitrarily selected units of the DR. They exhibit oscillations ofthe same frequency as the output signal, transposed and scaled in theiramplitude range according to the input signal.

[0112] In the frequency meter mode of exploitation, the second extraunit was used as an input unit into which oscillations of varyingfrequency are written. The first extra unit served now as the outputunit. FIG. 7h shows an input signal. FIG. 7i presents an overlay of theperfect output (solid line) with the actually generated output (dashedline). The network has apparently learnt to serve as a frequency meter,although again with some distortion in the low and high ends of range. Atrace plot of DR units would look exactly like in the frequencygenerator mode and is omitted.

[0113] The challenge in this example is twofold. First, the network hadto learn not an output dynamics per se, but rather “discover” thedynamical relationship between the two training signals. Second, thetime scales of the two signals are very different: the frequency targetis essentially stationary, while the oscillation signal changes on afast timescale. A bidirectional information exchange between signals ofdifferent timescales, which was requested from the trained network,presents a particular difficulty. Using a noisy update rule duringlearning was found to be indispensable in this example to obtain stabledynamics in the trained network.

[0114] This example is an instance of another dependent claim of theinvention, namely, to use the method of the invention to train an RNN onthe dynamic relationship between several signals. More specifically, theclaim is (1) to present training data {tilde over (y)}₁(t), . . .,{tilde over (y)}_(n)(t) to n extra units of a DR architecture accordingto the invention, where these extra units have feedback connections tothe DR, (2) train the network such that the mean square error from Eq.(4) is minimized, and then (3) exploit the network in any “direction” byarbitrarily declaring some of the units as input units and the remainingones as output units.

[0115] Discussion of Examples

[0116] The examples highlight what the invariant, independent core ofthe invention is, and what are dependent variants that yield alternativeembodiments.

[0117] Common aspects in the examples are:

[0118] use of a DR, characterized by the following properties:

[0119] its weights are not changed during learning

[0120] its weights are globally scaled such that a marginally globallystable dynamics results

[0121] the DR is designed with the aim that the impulse responses ofdifferent units be different

[0122] the number of units is greater than would strictly be requiredfor a minimal-size RNN for the respective task at hand (overcompletebasis aspect)

[0123] training only the DR-to-output connection weights such that themean square error from Eq. (4) is minimized over the training data.

[0124] The examples exhibit differences in the following aspects:

[0125] The network may have a topological/spatial structure (2dimensional grid in the excitable medium example and band matrix inducedlocality in two-way device example) or may not have such structuring(other examples).

[0126] The required different impulse responses of DR units can beachieved by explicit design of the DR (excitable medium example) or byrandom initialization (other examples).

[0127] The update law of the network can be the standard method ofequation (1) (short term memory, excitable medium example) or other(leaky integration update rule in chaotic oscillator, noisy update intwo-way device).

[0128] The computation of the DR-to-output connection weights can bedone offline (short term memory, excitable medium, two-way device) oron-line (chaotic oscillator), using any standard method for mean squareerror minimization.

DETAILED DESCRIPTION OF THE INVENTION AND PREFERRED EMBODIMENTS

[0129] Preferred embodiments of the invention are now described indetail. Like in the Summary of the Invention, the detailed descriptionis organized by presenting first the architectural and setup aspects,and then the procedural aspects of the learning method.

[0130] Setup of the DR

[0131] A central architectural aspect of the invention is the provisionof the DR whose weights are fixed and are not changed by subsequentlearning. The purpose of the DR for the learning method of thisinvention is to provide a rich, stable, preferably long-lastingexcitable dynamics. The invention provides the following methods torealize this goal.

[0132] Rich Dynamics through Large Network Size

[0133] Preferred embodiments of the invention have relatively large DRsto provide for a rich variety of different unit dynamics. 50 units and(many) more would be typical cases, less than 50 units would be suitableonly for undemanding applications like learning simple oscillators.

[0134] Rich Dynamics through Inhomogeneous Network Structure

[0135] Preferred embodiments of the invention achieve a rich variety inthe impulse responses of the DR units by introducing inhomogeneity intothe DR. The following strategies, which can be used singly or incombination, contribute to the design goal of inhomogeneity:

[0136] realize an inhomogeneous connectivity structure in the DR,

[0137] by constructing the DR connection randomly and sparsely,

[0138] by using a band-structured connectivity matrix, which leads tospatial decoupling of different parts of the DR (strategy not used inabove examples),

[0139] by imposing some other internal structuring on the DR topology,e.g. by arranging its units in layers or modules,

[0140] equip DR units with different response characteristics, by givingthem

[0141] different transfer functions,

[0142] different time constants,

[0143] different connection weights.

[0144] Marginally Stable Dynamics through Scaling

[0145] A preferred method to obtain a DR with a globally stable dynamicsis to first construct an inhomogeneous DR according to the previouslymentioned preferred embodiments, and then globally scale its weights bya common factor α which is selected such that

[0146] 1. the network dynamics is globally stable, i.e. from anystarting activation the dynamics decays to zero, and

[0147] 2. this stability is only marginal, i.e. the network dynamicsbecomes unstable if the network weights are further scaled by a factorα′=1+δ, which is greater than unity by a small margin.

[0148] When δ in the scaling factor α′=1+δ is varied, the networkdynamics undergoes a bifurcation from globally stable to some otherdynamics at a critical value δ_(crit). This value was called thestability margin in the examples above. The only method currentlyavailable to determine the stability margin of a given scaling factor isby systematic search.

[0149] Tuning Duration of Short-Term Memory through Tuning Marginalityof Stability

[0150] In many applications of RNNs, a design goal is to achieve a longshort-term memory in the learnt RNN. This design goal can be supportedin embodiments of the invention by a proper selection of the stabilitymargin of the DR.

[0151] The smaller the stability margin, the longer the effectiveshort-term memory duration. Therefore, the design goal of long-lastingshort-term memory capabilities can be served in embodiments of theinvention by setting the stability margin to small values. In typicalembodiments, where maximization of short-term memory duration is a goal,values of a smaller than 0.1 are used.

[0152] Presenting Input to the DR

[0153] In the field of artificial neural networks, by far the mostcommon way to present input to networks is by means of extra inputunits. This standard method has been used in the above examples.Alternative methods to feed input into a RNN are conceivable, but eitherare essentially notational variants of extra input units (e.g., addinginput terms into the DR unit activation update equation Eq. (1)) or arevery rarely used (e.g., modulating global network parameters by input).Any method is compatible with the method of the invention, as long asthe resulting dynamics of the DR is (1) significantly affected by theinput, (2) the required variability of individual DR unit's dynamics ispreserved.

[0154] The most common way of presenting input (by extra input units) isnow described in more detail.

[0155] According to the method of the invention, the connectivitypattern from input units to the DR network, and the weights on theseinput-DR-connections, are fixed at construction time and are notmodified during learning.

[0156] In preferred embodiments of the invention, theinput-DR-connections and their weights are fixed in two steps. In step1, the connectivity pattern is determined and the weights are put toinitial values. In step 2, the weight values are globally scaled tomaximize performance. These two steps are now described in more detail.

[0157] Step 1: Establish input-to-DR connections and put their weightsto initial values. The design goal to be achieved in step 1 is to ensurea high variability in the individual DR units' responses to inputsignals. This goal is reached, according to the method of the invention,by following the following rules, which can be used in any combination:

[0158] a Provide connections sparsely, i.e., put zero weights to many ormost of the possible connections from an output unit to DR units.

[0159] Select the feedback weights of non-zero connections randomly bysampling from a probability distribution (as in the chaotic oscillatorlearning example).

[0160] Assign different signs to the feedback weights of non-zeroconnections, i.e. provide both inhibitory and excitatory feedbackconnections.

[0161] Step 2: Scale the weights set in step 1 globally. The goal ofstep 2 is to optimize performance. No general rule can be given.According to the specific purpose of the network, different scalingranges can be optimal, from very small to very large absolute weights.It will be helpful to observe the following rules, which are given herefor the convenience of the user. They are applicable in embodimentswhere the update rule of the DR network employs nonlinear (typically,sigmoid) transfer functions.

[0162] Large weights are preferred for fast, high-frequency I/O responsecharacteristics, small weights for slow signals or when some lowpasscharacteristics are desired. For instance, in training a multistable(multiflop) memory network (not described in this document), where theentire network state had to switch from one attractor to another througha single input impulse, quite large input-to-DR weights with values of±5.0 were used.

[0163] Large weights are preferred when highly nonlinear, “switching”I/O dynamics are desired, small weights are preferred for more linearI/O-dynamics.

[0164] Large weights are preferred for tasks with low temporal memorylength requirements (i.e., output at time t depends significantly onlyon few preceding 15 inputs and outputs), small weights for long temporalmemory effects. For instance, in the delay line example (where largememory length was aimed for), very small input-to-DR weights of ±0.001were used.

[0165] If there are many input channels, channels whoseinput-DR-connections have greater absolute weights are emphasized intheir influence on the system output compared to low-weight channels.

[0166] Reading Output from the Network in the Exploitation Phase

[0167] According to the method of the invention, output is read from thenetwork always from output units. During the exploitation phase, thej-th output y_(j)(t+1) (j=1, . . . ,m) is obtained from the j-th outputunit by an application of the update rule Eq. (1), i.e. byy_(j)(t+1)=f_(j)<w_(j),x(t)>, where the inner product <w_(j),x(t)>denotes the sum of weighted activations of input units u(t), DR unitsx(t), and output units y(t):

w_(j1)u₁(t)+ . . . +w_(jn)u_(n)(t)+w_(j,n+1)x₁(t)+ . . .+w_(j,n+K)x_(K)(t)+w_(j,n+K+1)y₁(t)+ . . . +w_(j,n+K+m)y_(m)(t),

[0168] passed through the transfer function f_(j) of the j-th outputunit. In typical embodiments, f_(j) is a sigmoid or a linear function.

[0169] Feedback Connections from Output Units to the DR

[0170] Depending on the desired task, the method of the inventionprovides two alternatives concerning feedback from the output units tothe DR: (a) the network can be set up without such connections, (b) thenetwork can be equipped with such connections. Embodiments of theinvention of type (a) will typically be employed for passive filteringtasks, while case (b) typically is required for active signal generationtasks. However, feedback connections can also be required in filteringtasks, especially when the filtering task involves modeling a systemwith an autonomous state dynamics (as in the two-way device example).This situation is analogous, in linear signal processing terminology, toinfinite impulse response (IIR) filters. However, this terminology iscommonly used for linear filters. RNNs yield nonlinear filters.Therefore, in this patent application another terminology shall be used.RNNs which have input and feedback connections from the output unitswill be referred to as serving active filtering tasks.

[0171] According to the method of the invention, when feedbackconnections are used (i.e., in signal generation or active filteringtasks), they are fixed at the design time of the network and not changedin the subsequent learning.

[0172] The setup of output-to-DR feedback connections is completelyanalogous to the setup of input-to-DR connections, which was describedin detail above. Therefore, it suffices here to repeat that in apreferred embodiment of the invention, the output-to-DR feedbackconnections are designed in two steps. In the first step, theconnectivity pattern and an initial set of weights are fixed, while inthe second step the weights are globally scaled. The design goals andheuristic rules described for input-to-DR connections apply tooutput-to-DR connections without change, and need not be repeated.

[0173] Optimizing the Output MSE by Training the DR-to-Output Weights

[0174] After the network has been set up by providing a DR network andsuitable input- and output facilities, as related above, the method ofthe invention proceeds to determine the weights from DR units (and alsopossibly from input units, if they are provided) to the output units.This is done through a supervised training process.

[0175] Training Criterium: Minimizing Mean Square Output Error

[0176] The weights of connections to output units are determined suchthat the mean square error Eq. (4) is minimized over the training data.Equation (4) is here repeated for convenience: $\begin{matrix}{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}{\left( {{f^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - {\langle{w_{j},{x(t)}}\rangle}} \right)^{2}.}}}} & (4)\end{matrix}$

[0177] In (4), {tilde over (y)}_(j)(t) is the desired (teacher) outputof the j-th output unit, to which the inverse of the transfer functionf_(j) of this unit is applied. The term <w_(j),x(t)> denotes the innerproduct

w_(j1)u₁(t)+ . . . +w_(jn)u_(n)(t)+w_(j,n+1)x₁(t)+ . . .+w_(j,n+K)x_(K)(t)+w_(j,n+K+1)y₁(t)+ . . . +w_(j,n+K+m)y_(m)(t),   (5)[reapeated]

[0178] where u_(i)(t) are activations of input units (if applicable),x_(i)(t) of DR units, and y_(i)(t) of output units.

[0179] In alternative embodiments of the invention which employ onlineadaptive methods, instead of mining the MSE Eq. (4), it is also possibleto minimize the following mean square error: $\begin{matrix}{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}{\left( {{{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} - {f_{j}{\langle{w_{j},{x(t)}}\rangle}}} \right)^{2}.}}}} & \left( 4^{\prime} \right)\end{matrix}$

[0180] The theoretical difference between the two variants is that inthe first case (Eq. (4)), the learning procedure will minimize outputunit state error, while in the second case, output value error isminimized. In practice this typically does not make a significantdifference, because output unit state and output value are directlyconnected by the transfer function. In the examples described in theexamples section, version (4) was used throughout.

[0181] In yet alternative embodiments of the invention, the MSE to beminimized refers only to a subset of the input, DR, and output units.More precisely, in these alternative embodiments, the MSE$\begin{matrix}{{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}\left( {{f^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - {\langle{{s \cdot w_{j}},{x(t)}}\rangle}} \right)^{2}}}}{or}} & \left( 4^{*} \right) \\{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}\left( {{{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} - {f_{j}{\langle{{s \cdot w_{j}},{x(t)}}\rangle}}} \right)^{2}}}} & \left( 4^{\prime*} \right)\end{matrix}$

[0182] is minimized, where s is a vector of the same length as w_(j),consisting of 0's and 1's, and r·s=(r₁ . . . r_(k))·(s₁ . . .s_(k))=(r₁s₁ . . . r_(k)s_(k)) denotes elementwise multiplication. Theeffect of taking <s·w_(j),x(t)> instead of <w_(j),x(t)> is that only theinput/DR/output units selected by the selection vector s are used forminimizing the output error. The connection weights from thoseinput/DR/output units which are marked by 0's in s, to the output units,are put to zero. Specifically, variants (4*) or (4′*) can be used topreclude the learning of output-to-output connections. Variant (4*) wasused in the examples “short-time memory” and “feedback controller”(precluding output-to-output feedback), and in the “excitable medium”example (extensive use of (4*) for defining the local neighborhoodsshown in FIGS. 4a,b).

[0183] Training Method: Supervised Teaching with Teacher Forcing

[0184] According to the method of the invention, the MSE (4), (4′), (4*)or (4′*) is minimized through a procedure of supervised teaching. Atraining sequence consisting of an input time series u(t) and a(desired) output time series {tilde over (y)}(t) must be available,where t=1,2, . . . ,N. The input sequence u(t) may be absent when thelearning task is to learn a purely generative dynamics, as in the Lorenzattractor and the excitable medium examples.

[0185] According to the method of the invention, the activations of theDR are initialized at time t=1. Preferably, the DR activations areinitialized to zero or to small random values.

[0186] The method of the invention can be used for constructive offlinelearning and for adaptive online learning. The method of the inventioncan be adjusted to these two cases, as detailed below. However, severalaspects of the invention are independent from the online/offlinedistinction.

[0187] According to one aspect which is independent from theonline/offline distinction, the input training sequence u(t) is fed intothe DR for t=1,2, . . . ,N.

[0188] According to another aspect of the invention which is independentfrom the online/offline distinction, the output training sequence {tildeover (y)}(t)=({tilde over (y)}₁(t), . . . ,{tilde over (y)}_(m)(t)) iswritten into the m output units, i.e., the activation y_(j)(t) of thej-th output unit (j=1, . . . ,m) at time t is set to {tilde over(y)}_(j)(t). This is known in the RNN field as teacher forcing. Teacherforcing is essential in cases where there are feedback connections fromthe output units to the DR In cases where such feedback connections arenot used, teacher forcing is inconsequential but assumed nonetheless forthe convenience of a unified description of the method.

[0189] According to another procedural aspect of the invention which isindependent from the online/offline distinction, the DR units areupdated for time steps t=1,2, . . . ,N. The particular update law isirrelevant for the method of the invention. The repeated update of theDR generates an activation vector sequence x(1), . . . ,x(N), where x(t)is a vector containing the activations of the network's units (includinginput units but excluding output units) at time t.

[0190] In preferred embodiments of the invention, a small amount ofnoise is added to the network dynamics during the training phase. Onemethod to add noise is to use update equation (1′), i.e. add a noiseterm to each network state at each update time. An alternative method tointroduce noise is to add noise to the input signals u(t) and/or {tildeover (y)}(t). More specifically, instead of writing u(t) into the inputunits, write u(t)+v(t) into them; and instead of teacher-forcing {tildeover (y)}(t) into the output units, write {tilde over (y)}(t)+v(t) intothe output units (v(t) is a noise term). Note however that when a noisysignal {tilde over (y)}(t)+v(t) is used for teacher-forcing, theto-be-minimized MSE still refers to the non-noisified versions of thetraining output, i.e. to the chosen variant of Eq. (4).

[0191] Adding noise to the network dynamics is particularly helpful insignal generation and active signal processing tasks, where output-to-DRfeedback connections are present. In such cases, the added noiserandomly excites such internal units which have no stable systematicdynamic relationship with the desired I/O behavior; as a consequence,weights from such “unreliable” units to the output units receive verysmall values from the learning procedure. The net effect is that theresulting trained network behaves more robustly (i.e., less susceptibleto perturbations). Adding noise was found to be indispensable in the“two-way device” example.

[0192] Adding noise is also beneficial in cases where the training dataset is not much larger than the network size. In such cases, there isdanger of overfitting the training data, or stated in an alternativeway: it is then difficult to achieve good generalization performance.Insertion of noise prevents the network from fitting to idiosyncrasiesin the training data, thereby improving generalization. Adding noise tocounteract overfitting was a necessity in the “pendulum control”example, where only a small part of the plant's control regime wasvisited during training, but still a reasonably generalized performancewas achieved.

[0193] Further aspects of the invention are specific for the alternativecases of off-line learning and on-line learning. Detailed descriptionsfollow of how the method of the invention works in the two cases.

[0194] Description of One Update Step for Data Collection in theTraining Phase (Offline Case)

[0195] When the method of the invention is used for offline learning,the training data are presented to the network for t=1,2, . . . ,N, andthe resulting network states during this period are recorded. After timeN, these data are then used for offline construction of MSE-minimizingweights to the output units. According to the method of the invention,the following substeps must be performed to achieve one complete updatestep.

[0196] Input to Update Step t→t+1:

[0197] 1. DR units activation state x₁(t), . . . ,x_(K)(t)

[0198] 2. output units activation state y₁(t), . . . ,y_(m)(t)(identical to teacher signal {tilde over (y)}₁(t), . . . ,{tilde over(y)}_(m)(t))

[0199] 3. input signal u₁(t+1), . . . ,u_(n)(t+1) [unless the task is apure signal generation task without input]

[0200] 4. teacher output {tilde over (y)}₁(t+1), . . . ,{tilde over(y)}_(m)(t+1)

[0201] Output After Update Step t→t+1:

[0202] 1. DR units activation state x₁(t+1), . . . ,x_(K)(t+1)

[0203] Side Effect of Update Step t→t+1:

[0204] 1. Network state vector x(t+1) and teacher output {tilde over(y)}₁(t+1), . . . ,{tilde over (y)}_(m)(t+1) are written to memory

[0205] Substeps:

[0206] 1. [unless the task is a pure signal generation task withoutinput] Feed input u₁(t+1), . . . ,u_(n)(t+1) to the network, using thechosen input presentation method. When input is fed into the network bymeans of extra input units (the standard way), this means that theactivations of the n input units are set to u₁(t+1), . . . ,u_(n)(t+1).The total network state is now u₁(t+1), . . . ,u_(n)(t+1),x₁(t), . . .,x_(K)(t),y_(n)(t), . . . ,y_(n)(t) [in the case when input units areused; otherwise omit the first u₁(t+1), . . . ,u_(n)(t+1)].

[0207] 2. Update the state of the DR units, by applying the chosenupdate rule. For instance, when Eq. (4) is used, for every i=1 . . . ,Kevaluate $\begin{matrix}{{x_{i}\left( {t + 1} \right)} = {f_{i}\left( {{w_{i1}{u_{1}\left( {t + 1} \right)}}\quad + \ldots \quad + {w_{i\quad n}u_{n}\left( {t + 1} \right)} +} \right.}} \\{{{w_{i,{n + 1}}{x_{1}(t)}} + \quad \ldots \quad + {w_{i,{n + K}}{x_{K}(t)}} +}} \\\left. {{w_{i,{n + K + 1}}{y_{1}(t)}} + \quad \ldots \quad + {w_{i,{n + K + m}}{y_{m}(t)}}} \right)\end{matrix}$

[0208] 3. Write x(t+1)=u₁(t+1), . . . ,u_(n)(t+1),x₁(t+1), . . .,x_(K)(t+1),y₁(t), . . . ,y_(n)(t) and {tilde over (y)}₁(t+1), . . .,{tilde over (y)}_(m)(t+1) into a memory for later use in the offlinecomputation of optimal weights. [In cases where the MSE to be minimizedis of form (4*), write into memory x(t+1)=s·(u₁(t+1), . . .,u_(n)(t+1),x₁(t+1), . . . ,x_(K)(t+1),y₁(t), . . . ,y_(n)(t))]

[0209] 4. Write the teacher signal {tilde over (y)}₁(t+1), . . . ,{tildeover (y)}_(m)(t+1) into the output units (teacher forcing), i.e. puty₁(t+1), . . . ,y_(m)(t+1)={tilde over (y)}_(m)(t+1), . . . ,{tilde over(y)}_(m)(t+1).

[0210] Description of the Optimal Weight Computation in the Offline Case

[0211] At time N, N state-teacher output pairs x(t), {tilde over(y)}₁(t), . . . ,{tilde over (y)}_(m)(t) have been collected in memory.The method of the invention proceeds now to compute weights w_(j,i) fromall units which have entry 1 in the selection vector s_(EINBETTEN)x(t)to the j output units. These weights are computed such that the chosenvariant of MSE (e.g., (4) or (4*)) is minimized. Technically, this is alinear regression task, for which many efficient methods are available.(Technical data analysis software packages, like MatLab, Mathematica,LinPack, or statistical data analysis packages, all contain highlyrefined linear regression procedures. For the production of the examplesdescribed in this document, the FIT procedure of Mathematica was used).Because the particular way of how this linear regression is performed isnot part of the invention, and because it will not present anydifficulties to the practicing in the field, only the case that the MSE(4) is minimized is briefly treated here.

[0212] As a preparation, it is advisable to discard some initialstate-teacher output pairs, accommodating for the fact initialtransients in the network should die out before data are used fortraining. After this, for each output unit j, consider theargument-value vector data set (x(t),f⁻¹({tilde over (y)}_(j)(t)))_(t=t)₀ _(, . . . N). Compute linear regression weights for least mean squareerror regression of the values f⁻¹({tilde over (y)}_(j)(t)) on thearguments x(t), i.e. compute weights w_(j,i) such that the MSE Eq. (4)is minimized.

[0213] Write these weights into the network, which is now ready forexploitation.

[0214] Description of One Update Step in the Exploitation Phase

[0215] When the trained network is exploited, input u₁(t), . . .,u_(n)(t) is fed into it online [unless it is a pure signal generationdevice], and the network produces output y₁(t), . . . ,y_(m)(t) in anonline manner. For convenience, a detailed description of an update stepof the network during exploitation is given here.

[0216] Input to Update Step t→t+1:

[0217] 1. DR units activation state x₁(t), . . . ,x_(K)(t)

[0218] 2. output units activation state y₁(t), . . . ,y_(m)(t)

[0219] 3. input signal u₁(t+1), . . . ,u_(n)(t+1) [unless the task is apure signal generation task without input]

[0220] Output After Update Step t→t+1:

[0221] 1. DR units activation state x₁(t+1), . . . ,x_(K)(t+1)

[0222] 2. output units activation state y₁(t+1), . . . ,y_(m)(t+1)

[0223] Substeps:

[0224] 1. [unless the task is a pure signal generation task withoutinput] Feed input u₁(t+1), . . . ,u_(n)(t+1) to the network.

[0225] 2. Update the state of the DR units, by applying the chosenupdate rule. For instance, when Eq. (4) is used, for every i=1, . . . ,Kevaluate $\begin{matrix}{{x_{i}\left( {t + 1} \right)} = {f_{i}\left( {{w_{i1}{u_{1}\left( {t + 1} \right)}}\quad + \ldots \quad + {w_{i\quad n}u_{n}\left( {t + 1} \right)} +} \right.}} \\{{{w_{i,{n + 1}}{x_{1}(t)}} + \quad \ldots \quad + {w_{i,{n + K}}{x_{K}(t)}} +}} \\\left. {{w_{i,{n + K + 1}}{y_{1}(t)}} + \quad \ldots \quad + {w_{i,{n + K + m}}{y_{m}(t)}}} \right)\end{matrix}$

[0226] 3. Update the states of the output units, by applying the chosenupdate rule. For instance, when Eq. (4) is used, for every j=1 . . . ,mevaluate $\begin{matrix}{{y_{j}\left( {t + 1} \right)} = {f_{j}\left( {{w_{j\quad 1}{u_{1}\left( {t + 1} \right)}} + \ldots + {w_{j\quad n}{u_{n}\left( {t + 1} \right)}} + {w_{j,{n + 1}}{x_{1}\left( {t + 1} \right)}} + \ldots +} \right.}} \\{{{w_{j,{n + K}}{x_{K}\left( {t + 1} \right)}} + {w_{j,{n + K + 1}}{y_{1}(t)}} + \ldots +}} \\\left. {w_{j,{n + K + m}}{y_{m}(t)}} \right)\end{matrix}$

[0227] The important part to note here is the “cascaded” update: firstthe DR units are updated in substep 2, then the output units are updatedin substep 3. This corresponds to a similarly “cascaded” update in thetraining phase.

[0228] Variations

[0229] In updating recurrent neural networks with extra input- andoutput units, there is a some degree of liberty in the particularrelative update order of the various types of units (input, DR, output).For instance, instead of the particular “cascaded” update describedabove, in alternative embodiments the DR units and output units can beupdated simultaneously, resulting in slightly (but typically notsignificantly) different network behavior. In yet other alternativeembodiments, where the DR is endowed with a modular or layeredsubstructure, more complex update regulations may be required, updatingparticular regions of the network in a particular order. The importantthing to observe for the method of the invention is that whicheverupdate scheme is used, the same scheme must be used in the training andin the exploitation phase.

[0230] Description of One LMS Update Step for Online Adaptation

[0231] In contrast to the offline variants of the method, onlineadaptation methods can be used both for minimizing output state error(MSE criteria (4), (4*)) and for minimizing output value error (MSEcriteria (4′), (4′*)).

[0232] In online adaptation, the weights w_(j,i) to the j-th output unitare incrementally optimized at every time step, thereby becomingtime-dependent variables w_(j,i)(t) themselves. A host of well-knownmethods for online MSE-minimizing adaptation can be used for the methodof the invention, for instance stochastic gradient descent methods likethe LMS method or Newton's method (or combinations thereof), orso-called “deterministic” methods like the RLS method.

[0233] Among these, the LMS method is by far the simplest. It is notoptimally suited for the method of the invention (the reasons for thishave been indicated in the discussion of the Lorenz attractor example).Nonetheless, owing to its simplicity, LMS is the best choice for adidactical illustration of the principles of the online version of themethod of the invention.

[0234] Here is a description of one update step, using the LMS method tooptimize weights.

[0235] Input to Update Step t→t+1:

[0236] 1. DR units activation state x₁(t), . . . ,x_(K)(t)

[0237] 2. output units activation state y₁(t), . . . ,y_(m)(t)

[0238] 3. input signal u₁(t+1), . . . ,u_(n)(t+1) [unless the task is apure signal generation task without input]

[0239] 4. teacher output {tilde over (y)}₁(t+1), . . . ,{tilde over(y)}_(m)(t+1)

[0240] 5. weights w_(j,i)(t) of connections to the output units

[0241] Output After Update Step t→t+1:

[0242] 1. DR units activation state x₁(t+1), . . . ,x_(K)(t+1)

[0243] 2. output units activation state y₁(t+1), . . . ,y_(m)(t+1)

[0244] 3. new weights w_(j,i)(t+1)

[0245] Substeps:

[0246] 1. [unless the task is a pure signal generation task withoutinput] Feed input u₁(t+1), . . . ,u_(n)(t+1) to the network.

[0247] 2. Update DR units, by applying the chosen update rule. Forinstance, when Eq. (4) is used, for every i=1, . . . ,K evaluate$\begin{matrix}{{x_{i}\left( {t + 1} \right)} = {f_{i}\left( {{w_{i\quad 1}{u_{1}\left( {t + 1} \right)}} + \ldots + {w_{i\quad n}{u_{n}\left( {t + 1} \right)}} + {w_{i,{n + 1}}{x_{1}(t)}} + \ldots +} \right.}} \\{{{w_{i,{n + K}}{x_{K}(t)}} + {w_{i,{n + K + 1}}{y_{1}(t)}} + \ldots +}} \\\left. {w_{i,{n + K + m}}{y_{m}(t)}} \right)\end{matrix}$

[0248] 3. Update the states of the output units, by applying the chosenupdate rule. For instance, when Eq. (4) is used, for every j=1, . . . ,mevaluate $\begin{matrix}{{y_{j}\left( {t + 1} \right)} = {f_{j}\left( {{w_{j\quad 1}{u_{1}\left( {t + 1} \right)}} + \ldots + {w_{j\quad n}{u_{n}\left( {t + 1} \right)}} + {w_{j,{n + 1}}{x_{1}\left( {t + 1} \right)}} + \ldots +} \right.}} \\{{{w_{j,{n + K}}{x_{K}\left( {t + 1} \right)}} + {w_{j,{n + K + 1}}{y_{1}(t)}} + \ldots +}} \\\left. {w_{j,{n + K + m}}{y_{m}(t)}} \right)\end{matrix}$

[0249] 4. For every output unit j=1, . . . ,m, update weightsw_(j)(t)=(w_(j,1)(t), . . . ,w_(j,n+K+m)(t)) to w_(j)(t+1), according tothe adaptation method chosen. Here the LMS method is described as anexample. It comprises the following substeps:

[0250] a. Compute the error ε_(j)(t+1)={tilde over(y)}_(j)(t+1)−y_(j)(t+1). [Note: this yields an output value error, andconsequentially, the MSE of Eq. (4′) will be minimized. In order tominimize the output state error, use ε_(j)(t+1)=f_(j) ⁻¹({tilde over(y)}_(j)(t+1))−f_(j) ⁻¹(y_(j)(t+1)) instead.]

[0251] b. Put w_(j)(t+1)=w_(j)(t)+με_(j)(t+1)x(t), where μ is a learningrate and x(t) is the total network state (including input and outputunits) obtained after step 3.

[0252] 5. If there are output-to-DR feedback connections, write theteacher signal {tilde over (y)}₁(t+1), . . . ,{tilde over (y)}_(m)(t+1)into the output units (teacher forcing), i.e. put y₁(t+1), . . .,y_(m)(t+1)={tilde over (y)}₁(t+1), . . . ,{tilde over (y)}_(m)(t+1)

[0253] Like in the offline version of the method of the invention, manytrivial variations of this update scheme exist, distinguished from eachother e.g. by the update equation (which version of Eq. (4)), by theparticular order in which parts of the network are updated in a cascadedfashion, by the specific method in which input is administered, etc.These variations are not consequential for the method of the invention;the above detailed scheme of an update step is only an illustration ofone possibility.

1. A method for constructing a discrete-time recurrent neural networkand training it in order to minimize its output error, comprising thesteps a. constructing a recurrent neural network as a reservoir forexcitable dynamics (DR network); b. providing means of feeding input tothe DR network; c. attaching output units to the DR network throughweighted connections; d. training the weights of the connections fromthe DR network to the output units in a supervised training scheme. 2.The method of claim 1, wherein the DR network has a large number ofunits (greater than 50).
 3. The method of claim 1 or 2, wherein the DRnetwork is sparsely connected.
 4. The method of any one of claims 1 to3, wherein the connections within the DR network have randomly assignedweights.
 5. The method of any one of claims 1 to 4, wherein differentupdate rules or differently parameterized update rules are used fordifferent DR units.
 6. The method of any one of claims 1 to 5, wherein aspatial structure is imprinted on the DR network through theconnectivity pattern.
 7. The method of claim 6, wherein the spatialstructure is a regular grid.
 8. The method of claim 6, wherein thespatial structure is a local neighborhood structure (induced by bandedor subbanded structure of the connectivity matrix).
 9. The method ofclaim 6, wherein the spatial structure is modular or organized inlevels.
 10. The method of any one of claims 1 to 9, wherein the weightswithin the DR are globally scaled such that the resulting dynamics ofthe isolated DR network is globally stable.
 11. The method of any one ofclaims 1 to 9, wherein the weights within the DR are globally scaledsuch that the resulting dynamics of the isolated DR network ismarginally globally stable, in order to achieve long duration of memoryeffects in the final network after training.
 12. The method of claim 10or 11, wherein input is fed to the DR by means of extra input units. 13.The method of claim 12, wherein the connections from the input units tothe DR are sparse.
 14. The method of claim 12 or 13, wherein the weightsof connections from the input units to the DR are randomly fixed andhave negative and positive signs.
 15. The method of any one of claims 12to 14, wherein the weights of connections from the input units to the DRare globally scaled to small absolute values in order to achieve a longduration of memory effects in the final network I/O characteristics, orin order to achieve slow or low-pass time characteristics in the finalnetwork I/O characteristics, or in order to achieve nearly linear I/Ocharacteristics.
 16. The method of any one of claims 12 to 14, whereinthe weights of connections from the input units to the DR are globallyscaled to absolute large values in order to achieve short duration ofmemory effects, or in order to achieve fast I/O behavior, or in order toachieve highly nonlinear or “switching” characteristics in the finaltrained network.
 17. The method of claim 10 or 11, wherein input is fedto the DR by means other than by extra input units.
 18. The method ofany one of claims 1 to 17, wherein extra output units are attached tothe DR without feedback connections from the output units to the DR, inorder to obtain a passive signal processing network after training. 19.The method of any one of claims 1 to 17, wherein extra output units areattached to the DR with feedback connections from the output units tothe DR, in order to obtain an active signal processing or signalgeneration network after training.
 20. The method of claim 19, whereinthe feedback connections are sparse.
 21. The method of claim 19 or 20,wherein the weights of feedback connections are randomly fixed and havenegative and positive signs.
 22. The method of any one of claims 19 to21, wherein the weights of feedback connections are globally scaled tosmall absolute values in order to achieve a long duration of memoryeffects in the final network I/O characteristics, or in order to achieveslow or low-pass time characteristics in the final network I/Ocharacteristics, or in order to achieve linear I/O characteristics. 23.The method of any one of claims 19 to 21, wherein the weights ofconnections from the input units to the DR are globally scaled toabsolute large values in order to achieve short duration of memoryeffects, or in order to achieve fast I/O behavior, or in order toachieve highly nonlinear or “switching” characteristics in the finaltrained network.
 24. The method of any one of claims 1 to 23, whereinthe network is trained in an offline version of supervised teaching. 25.The method of claim 24, wherein the task to be learnt is a signalgeneration task, no input exists, and the teacher signal consists onlyof a sample of the desired output signal.
 26. The method of claim 24,wherein the task to be learnt is a signal processing task, where inputexists, and where the teacher signal consists of a sample of the desiredinput/output pairing.
 27. The methods of any one of claims 24 to 26,wherein output-error-minimizing weights of the connections to the outputnodes are computed, comprising the steps a. presenting the teachersignals to the network and running the network in teacher-forced modefor the duration of the teaching period, b. saving into a memory thenetwork states and the signals f_(j) ⁻¹({tilde over (y)}_(j)(t))obtained by mapping the inverse of the output unit's transfer functionon the teacher output, c. optionally discarding initial state/outputpairs in order to accommodate initial transient effects, d. computingthe weights of the connections to the output nodes by a standard linearregression method.
 28. The method of any one of claims 24 to 27, whereinduring the training period noise is inserted into the network dynamics,by utilizing a noisy update rule and/or by adding noise on the inputand/or (if output-to-DR feedback connections exist) by adding a noisecomponent to the teacher output before it is fed back into the DR. 29.The methods of any one of claims 24 to 28, wherein weights of connectionfrom only a subset of the networks units (i.e., a subset of the input,DR, output units) to the output units are trained, and the other onesare set to zero.
 30. The methods of any one of claims 1 to 23, whereinthe network is trained in an online version of supervised teaching. 31.The method of claim 30, wherein the task to be learnt is a signalgeneration task, no input exists, and the teacher signal consists onlyof a sample of the desired output signal.
 32. The method of claim 30,wherein the task to be learnt is a signal processing task, where inputexists, and where the teacher signal consists of a sample of the desiredinput/output pairing.
 33. The method of any one of claims 30 to 32,wherein output-error-minimizing weights of the connections to the outputnodes are updated at every time step, the update comprising the substepsa. feeding the input to the network and updating the network, b. forevery output unit, computing an error as the difference between thedesired teacher output and the actual network output (output valueerror); or, alternatively, as the difference between the value f_(j)⁻¹({tilde over (y)}_(j)(t)) obtained by mapping the inverse of theoutput unit's transfer function on the teacher output, and the valueobtained by mapping the inverse of the output unit's transfer functionon the actual output (output state error), c. updating the weights ofthe connections to the output nodes by a standard method for minimizingthe error computed in the previous substep b., d. in cases of signalgeneration tasks or active signal processing tasks, forcing the teacheroutput into the output units.
 34. The method of any one of claims 30 to33, wherein noise is inserted into the network dynamics, by utilizing anoisy update rule or (optionally, if feedback connections exist) byadding a noise component to the teacher output before it is fed backinto the DR.
 35. The method of any one of claims 30 to 34, whereinweights of connection from only a subset of the networks units (i.e., asubset of the input, DR, output units) to the output units are trained,and the other ones set to zero.
 36. The method of any one of claims 1 to35, wherein the network is trained on two or more output units withfeedback connections to the DR, which in the exploitation phase areutilized in any chosen “direction”, by treating any some of the trainedunits as input units and the remaining ones as output units. (Thisrealizes the learning of dynamical relationships between signals.) 37.The method of claim 36 applied to tasks of reconstructive memory ofmultidimensional dynamical patterns, comprising a. training the networkwith teaching signals consisting of complete-dimensional samples of thepatterns, b. in the exploitation phase, presenting cue patterns whichare incompletely given in only some of the dimensions as input in thosedimensions, and reading out the completed dynamical patterns on theremaining units.
 38. The method of any one of claims 1 to 35, applied totasks of closed-loop (state or observation feedback) tracking control ofa plant, comprising a. using training samples consisting of two kinds ofinput signals to the network, namely, (i) a future version of thevariables that will serve as a reference signal in the exploitationphase, and (ii) plant output observation (or plant state observation);and consisting further of a desired network output signal, namely, (iii)plant control input, b. training a network using the teacher input andoutput signal from a., in order to obtain a network which computes asnetwork output a plant control input (i.e., (iii)), depending on thecurrent plant output observation (i.e., (ii)) and a future version ofreference variables (i.e., (i)), c. exploiting the network as anclosed-loop controller by feeding it with the inputs (i) futurereference signals, (ii) current plant output observation (or plant stateobservation); and letting the network generate the current plant controlinput.
 39. A neural network constructed and trained according to any oneof the preceeding claims.
 40. A neural network according to claim 39,wherein it is implemented as a microcircuit.
 41. A neural networkaccording to claim 39, wherein it is implemented by a suitablyprogrammed computer.